---
title: "Loading EU-SILC data processed in Stata"
output:
  html_document: default
date: "May 13, 2020"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(haven) # for loading .dta files
library(tidyverse) # for data manipulation
```

Please change the path (line 113) into your personal folder where the code files were placed.

# Data description.

dataset `.dta` file from Stata. 
```
i year country rb020 health married child emp ltdhi urbanisation male ageg educ6

 *var1 i       : unique identifier
*var2 year   : year of the interview
*var3 country   : country of the individual, numeric
*var4 rb020   : country of the individual, string
*var5 health   : health information
*var6 married   : married status of the individual
*var7 child   : number of children of the individual
*var8 emp   : employment status of the individual

// =========================================================================
// Variables explanation: description of the details
// =========================================================================

*var1 i       : unique identifier

*Ids are modified relative to original SILC ids as follows:
*egen string c_pid = concat(country k pid), format(%60.0g)
*generate c_pid_n=c_pid
*egen i=group(c_pid_n yearl)

*var2 year   : year of the interview
*2003-2016

*var3 country   : country of the individual, numeric
*var4 rb020   : country of the individual, string
*var5 health   : health information
*genhealthlabel 1 "Very good" 2 "Good" 3 "Fair" 4 "Bad" 5 "Very bad"

*var6 married   : married status of the individual
*marriedlabel 0 "Not married" 1 "Married"

*var7 child   : number of children of the individual
*childlabel 0 "0" 1 "1" 2 "2" 3 ">=3"

*var8 emp   : employment status of the individual
/* original variable
1 Employee working full-time
2 Employee working part-time
3 Self-employed working full-time (including family worker)
4 Self-employed working part-time (including family worker)
5 Unemployed
6 Pupil, student, further training, unpaid work experience
7 In retirement or in early retirement or has given up business
8 Permanently disabled or/and unfit to work
9 In compulsory military or community service
10 Fulfilling domestic tasks and care responsibilities
11 Other inactive person
*/

/*became thourgh time homogenous using 9 categories

*emp9label 1 "Full-time" 2 "Part-time" ///
3 "Unemployed" 4 "Pupil, student, further training, unpaid work experience" ///
5 "In retirement or in early retirement or has given up business" ///
6 "Permanently disabled or/and unfit to work" ///
7 "In compulsory military or community service" ///
8 "Fulfilling domestic tasks and care responsibilities" ///
9 "Other inactive person" ///
. "missing information employment status"
*/

/*variable emp has 4 consistent categories:
emp4label
1 "Full-time or Part-time" ///
2 "Unemployed"
3 "Pupil, student, further training, unpaid work experience,  ///
In compulsory military or community service. Fulfilling domestic ///
tasks and care responsibilities" ///
4 "In retirement or in early retirement or has given up business ///
or Permanently disabled or/and unfit to work or Other inactive person" ///
. "missing information employment status"

*var9 male
*malelabel 0 "Female" 1 "Male"

*var10 age

*var 11 educ6
*la var educ6 "Education (1 lowest, 5 highest)"

la var urbanisation "Urbanisation (1 highest, 3 lowest)"

la var tdhi "Total disp household equivalized income"
```

# Introduction

Load entire data set.

```{r}
#Please change this path into your personal folder where the code files were placed. 
df_sample <- read_dta("folder/data/dataset.dta")
```

Look at the data.

```{r}
df_sample
```

- `i`       : unique identifier
- `year`   : year of the interview
- `country`   : country of the individual, numeric
- `rb020`   : country of the individual, string
- `health`   : health information
- `married`   : married status of the individual
- `child`   : number of children of the individual
- `emp`   : employment status of the individual
- `ltdhi` : log income.
- `urbanisation` : degree of urbanisation, 1-3
- `male` : gender, 0-1
- `ageg` : age categories  1-6
- `educ` : education categories 1-5

Arrange the data, create a `period` variable, kill of the redundant country identifier and year identifier, and force the variables into a format we like.

```{r}
df_2 <- df_sample %>% 
  select(-country) %>% 
  mutate(
    health = as.numeric(as.character(health)),
    married = as.numeric(as.character(married)),
    child = as.numeric(as.character(child)),
    emp = as.numeric(as.character(emp)),
    urbanisation = as.numeric(as.character(urbanisation)),
    male = as.numeric(as.character(male)),
    ageg = as.numeric(as.character(ageg)),
    educ6 = as.numeric(as.character(educ6))
  ) %>% 
  arrange(i,year) %>% 
  group_by(i) %>% 
    mutate(period = year - min(year) + 1) %>% 
    ungroup()

skimr::skim(df_2)
```

Kill the missings, then keep only the cross-sectional units with four observations. Also, remove redundant variables.

```{r}
df_bal <- df_2 %>% drop_na() %>%
  group_by(i) %>%
    mutate(Ti = n()) %>%
    ungroup() %>%
  filter(Ti==4) %>%
  select(-c(Ti,year)) %>%
  mutate(i = factor(i))
```

Using `skimr` to get a quick overview of the data.

```{r}
skimr::skim(df_bal)
```

Note that the `health` variables is not frequently equal to 1. 

```{r}
table(df_bal$health)
df_bal <- data.frame(df_bal)
require(foreign)
write.dta(df_bal, "data/balanced-data.dta")
```

Reverse the coding of the main dependent variable.

```{r}
df_bal <- df_bal %>% 
  mutate(
    health = 5 - health,
    health = ifelse(health == 0,1,health)
  ) %>% 
  mutate(
    emp = factor(emp)
  )
```

Check that the recoding was successful:

```{r}
table(df_bal$health)
```

Ask for summary statistics by country...

```{r}
df_bal %>% group_by(rb020) %>% 
  summarize(n_iy = n(),n_i = n_distinct(i)) %>% arrange(desc(n_i))
```

... and save the data as an `.RData` file for later use, in the `data` folder.

```{r}
save(df_bal,file = "data/balanced-data.RData")
```
